7 research outputs found
Models for learning reverberant environments
Reverberation is present in all real life enclosures. From our workplaces to our homes and even in places designed as auditoria, such as concert halls and theatres. We have learned to understand speech in the presence of reverberation and also to use it for aesthetics in music. This thesis investigates novel ways enabling machines to learn the properties of reverberant acoustic environments. Training machines to classify rooms based on the effect of reverberation requires the use of data recorded in the room. The typical data for such measurements is the Acoustic Impulse Response (AIR) between the speaker and the receiver as a Finite Impulse Response (FIR) filter. Its representation however is high-dimensional and the measurements are small in number, which limits the design and performance of deep learning algorithms. Understanding properties of the rooms relies on the analysis of reflections that compose the AIRs and the decay and absorption of the sound energy in the room. This thesis proposes novel methods for representing the early reflections, which are strong and sparse in nature and depend on the position of the source and the receiver. The resulting representation significantly reduces the coefficients needed to represent the AIR and can be combined with a stochastic model from the literature to also represent the late reflections. The use of Finite Impulse Response (FIR) for the task of classifying rooms is investigated, which provides novel results in this field. The aforementioned issues related to AIRs are highlighted through the analysis. This leads to the proposal of a data augmentation method for the training of the classifiers based on Generative Adversarial Networks (GANs), which uses existing data to create artificial AIRs, as if they were measured in real rooms. The networks learn properties of the room in the space defined by the parameters of the low-dimensional representation that is proposed in this thesis.Open Acces
End-to-End Classification of Reverberant Rooms using DNNs
Reverberation is present in our workplaces, our homes and even in places
designed as auditoria, such as concert halls and theatres. This work
investigates how deep learning can use the effect of reverberation on speech to
classify a recording in terms of the room in which it was recorded in.
Approaches previously taken in the literature for the task relied on handpicked
acoustic parameters as features used by classifiers. Estimating the values of
these parameters from reverberant speech involves estimation errors, inevitably
impacting the classification accuracy. This paper shows how DNNs can perform
the classification in an end-to-end fashion, therefore by operating directly on
reverberant speech. Based on the above, a method for the training of
generalisable DNN classifiers and a DNN architecture for the task are proposed.
A study is also made on the relationship between feature-maps derived by DNNs
and acoustic parameters that describe known properties of reverberation. In the
experiments shown, AIRs are used that were measured in 7 real rooms. The
classification accuracy of DNNs is compared between the case of having access
to the AIRs and the case of having access only to the reverberant speech
recorded in the same rooms. The experiments show that with access to the AIRs a
DNN achieves an accuracy of 99.1% and with access only to reverberant speech,
the proposed DNN achieves an accuracy of 86.9%. The experiments replicate the
testing procedure used in previous work, which relied on handpicked acoustic
parameters, allowing the direct evaluation of the benefit of using deep
learning.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Discriminative feature domains for reverberant acoustic environments
Several speech processing and audio data-mining applications rely on a description of the acoustic environment as a feature vector for classification. The discriminative properties of the feature domain play a crucial role in the effectiveness of these methods. In this work, we consider three environment identification tasks and the task of acoustic model selection for speech recognition. A set of acoustic parameters and Machine Learning algorithms for feature selection are used and an analysis is performed on the resulting feature domains for each task. In our experiments, a classification accuracy of 100% is achieved for the majority of tasks and the Word Error Rate is reduced by 20.73 percentage points for Automatic Speech Recognition when using the resulting domains. Experimental results indicate a significant dissimilarity in the parameter choices for the composition of the domains, which highlights the importance of the feature selection process for individual applications.</p
End-to-end classification of reverberant rooms using DNNs
Reverberation is present in our workplaces, our homes, concert halls and theatres. This paper investigates how deep learning can use the effect of reverberation on speech to classify a recording in terms of the room in which it was recorded. Existing approaches in the literature rely on domain expertise to manually select acoustic parameters as inputs to classifiers. Estimation of these parameters from reverberant speech is adversely affected by estimation errors, impacting the classification accuracy. In order to overcome the limitations of previously proposed methods, this paper shows how DNNs can perform the classification by operating directly on reverberant speech spectra and a CRNN with an attention-mechanism is proposed for the task. The relationship is investigated between the reverberant speech representations learned by the DNN and acoustic parameters. For evaluation, AIRs are used from the ACE-challenge dataset that were measured in 7 real rooms. The classification accuracy of the CRNN classifier in the experiments is 78% when using 5 hours of training data and 90% when using 10 hours.<br/
Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue
Entrainment is the phenomenon by which an interlocutor adapts their speaking
style to align with their partner in conversations. It has been found in
different dimensions as acoustic, prosodic, lexical or syntactic. In this work,
we explore and utilize the entrainment phenomenon to improve spoken dialogue
systems for voice assistants. We first examine the existence of the entrainment
phenomenon in human-to-human dialogues in respect to acoustic feature and then
extend the analysis to emotion features. The analysis results show strong
evidence of entrainment in terms of both acoustic and emotion features. Based
on this findings, we implement two entrainment policies and assess if the
integration of entrainment principle into a Text-to-Speech (TTS) system
improves the synthesis performance and the user experience. It is found that
the integration of the entrainment principle into a TTS system brings
performance improvement when considering acoustic features, while no obvious
improvement is observed when considering emotion features.Comment: This version has been removed by arXiv administrators because the
submitter did not have the right to assign a license at the time of
submissio